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Abstract 


Context. Separating active regions that arc quiet from potentially eruptive ones is a key issue 
in Space Weather applications. Traditional classification schemes such as Mount Wilson and 
McIntosh have been effective in relating an active region large scale magnetic configuration to 
its ability to produce eruptive events. However, their qualitative nature prevents systematic studies 
of an active region’s evolution for example. 

Aims. We introduce a new clustering of active regions that is based on the local geometry observed 
in Line of Sight magnetogram and continuum images. 

Methods. We use a reduced-dimension representation of an active region that is obtained by fac¬ 
toring the corresponding data matrix comprised of local image patches. Two factorizations can be 
compared via the definition of appropriate metrics on the resulting factors. The distances obtained 
from these metrics arc then used to cluster the active regions. 

Results. We find that these metrics result in natural clusterings of active regions. The clusterings 
arc related to large scale descriptors of an active region such as its size, its local magnetic field dis¬ 
tribution, and its complexity as measured by the Mount Wilson classification scheme. We also find 
that including data focused on the neutral line of an active region can result in an increased cor¬ 
respondence between our clustering results and other active region descriptors such as the Mount 
Wilson classifications and the R value. 

Conclusions. Matrix factorization of image patches is a promising new way of characterizing 
active regions. We provide some recommendations for which metrics, matrix factorization tech¬ 
niques, and regions of interest to use to study active regions. 

Key words. Sun - active region - sunspot - neutral line - data analysis - classification - clustering 
- image patches - Hellinger distance - Grassmannian 


1 



Moon et al: Image patch analysis of sunspots and active regions. II. Clustering 


Table 1 . Mount Wilson classification rules, number of each AR, and total number of joint patches 
or pixels per Mt. Wilson class used in this paper when using the STARA masks. 


Class 

Classification Rule 

Number of AR 

Number of Patches 

a 

A single dominant spot 

50 

13,358 

P 

A pah - of dominant spots of opposite polarity 

192 

75,463 

Py 

A /3 sunspot where a single north-south polarity 
inversion line cannot divide the two polarities 

130 

95,631 

Py8 

A /3y sunspot where umbrae of opposite polarity 
are together in a single penumbra 

52 

66,195 


1. Introduction 


1.1. Context 

Identifying properties of active regions (AR) that are necessary and sufficient for the production of 
energetic events such as solar flares is one of the key issues in space weather. The Mount Wilson 
classification (see Table 1 for a brief description of its four main classes) has been effective in 
relating a sunspot’s large scale magnetic configuration with its ability to produce flares. 

( ) pointed out the first clear connection between flare productivity and magnetic structure, 

and introduced a new magnetic classification, 8, to supplement Hale’s a , [3 and y classes ( 

, ). Several studies showed that a large proportion of all major flare events begin with a 8 

configuration (Warwick, 1966; Mayfield and Lawrence, 1985; Sammis et al., 2000). 

The categorical nature of the Mount Wilson classification, however, prevents the differentiation 
between two sunspots with the same classification and makes the study of an AR’s evolution cum¬ 
bersome. Moreover, the Mount Wilson classification is generally carried out manually which results 
in human bias. Several papers (Stenning et al., 2013; Colak and Qahwaji, 2008, 2009) have used 
supervised techniques to reproduce the Mount Wilson and other schemes which has resulted in a 
reduction in human bias. 

To go beyond categorical classification in the flare prediction problem, the last decade has seen 
many efforts in describing the photospheric magnetic configuration in more details. Typically, a 
set of scalar properties is derived from line of sight (LOS) or vector magnetogram and analyzed 
in a supervised classification context to derive which combination of properties is predictive of 
increased flaring activity (Leka and Bames, 2004; Guo et al., 2006; Schrijver, 2007; Barnes et al., 
2007; Georgoulis and Rust, 2007; Falconer et al., 2008; Song et al., 2009; Huang et al., 2010; Yu 
et al., 2010; Ahmed et al., 2011; Lee et al., 2012; Bobra and Couvidat, 2015). Examples of scalar 
properties include: sunspot area, total unsigned magnetic flux, flux imbalance, neutral line length, 
maximum gradients along the neutral line, or other proxies for magnetic connectivity within ARs. 
These scalar properties are features that can be used as input in flare prediction. However, there is 
no guarantee that these selected features exploit the information present in the data in an optimal 
way for the flare prediction problem. 
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1.2. Contribution 

We introduce a new data-driven method to cluster ARs using information contained in magne¬ 
togram and continuum. Instead of focusing on the best set of properties that summarizes the infor¬ 
mation contained in those images, we study the natural geometry present in the data via a reduced- 
dimension representation of such images. The reduced-dimension is implemented via matrix factor¬ 
ization of an image patch representation as explained in Section 1.3. We show how this geometry 
can be used for classifying ARs in an unsupervised way, that is, without including AR labels as 
input to the analysis. 

We consider the same dataset as in ( ). It is obtained from the Michelson Doppler 

Imager (MDI) instrument ( , ) on board the SOHO Spacecraft. SOHO-MDI pro¬ 

vides two to four times per day a white-light continuum image observed in the vicinity of the 
Ni i 676.7L nm photospheric absorption line. MDI LOS magnetograms are recorded with a higher 
nominal cadence of 96 minutes. We selected 424 sunspot images within the time range of 1996- 
2010. They span the various Mount Wilson classifications (see Table 1), are located within 30° 
of central meridian, and have corresponding observations in both MDI continuum and MDI LOS 
magnetogram. We use level 1.8 data for both modalities. 

Our method can be adapted to any definition of the support of an AR, or Region of Interest (ROI), 
and such ROI must be given a priori. We consider three types of ROIs: 

1. Umbrae and penumbrae masks obtained with the Sunspot Tracking and Recognition Algorithm 

(STARA) (Watson , ) from continuum images. These sunspot masks encompass the 

regions of highest variation observed in both continuum and magnetogram images, and hence are 
used primarily to illustrate our method. Figure 1 provides some examples of AR images overlaid 
with their respective STARA masks. 

2. The neutral line region, defined as the set of pixels situated no more than 10 pixels (20 arc- 

sec) away from the neutral line, and located within the Solar Monitor Active Region Tracker 
(SMART) masks ( , ), which defines magnetic AR boundaries. 

3. The set of pixels that are used as support for the computation of the R -value defined in 

( ). The irivalue measures a weighted absolute magnetic flux, where the weights are positive 

only around the neutral line. 

Our patch-based matrix factorization method investigates the fine scale structures encoded by 
localized gradients of various directions and amplitudes, or locally smooth areas for example. In 
contrast, the Mount Wilson classification encodes the relative locations and sizes of concentrations 
of opposite polarity magnetic flux on a large scale. Although both classification schemes rely on 
completely different methods, using the first ROI defined above, we find some similarities (see 
Section 5). Moreover the Mount Wilson classification can guide us in the interpretation of the 
results and clusters obtained. 

The shape of the neutral line separating the two main polarities in an AR is a key element in the 
Mount Wilson classification scheme, and the magnetic field gradients observed along the neutral 
line are important information in the quest for solar activity prediction ( er, ; 

, ). We therefore analyze the effect of including the neutral line region in Section 5. 

Results based on the third ROI are compared directly to the /rivaluc. The various comparisons 
enable us to evaluate the potential of our method for flare prediction. 
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Figure 1 . MDI continuum and magnetogram from NOAA 9097 on July 23, 2000 (top) and from 
NOAA 10045 on July 25, 2002 (bottom) overlaid with the corresponding STARA masks in green. 


1.3. Reduced dimension via matrix factorization 

Our data-driven method is based on a reduced-dimension representation of an AR ROI via matrix 
factorization of image patches. Matrix factorization is a widely used tool to reveal patterns in high 
dimensional datasets. Applications outside of solar physics are numerous and range e.g. from mul¬ 
timedia activity correlation, neuroscience, gene expression (Bazot et ah, 2013), to hyperspectral 
imaging (Mittelman et al., 2012). 
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The idea is to express a /c-multi variate observation Z\ as a linear combination of a reduced num¬ 
ber of r < k components a 7 , each weighted by some (possibly random) coefficients hjy. 

r 

Zi = ^ Zjhjj + ni, (1) 

7=1 

where represents residual noise. With Z = [zi,, z n \, the equivalent matrix factorization repre¬ 
sentation is written as 

Z = AH + N (2) 

where Z is a kxn data matrix containing n observations of k different variables, A is the k x r 
matrix containing the ‘dictionary elements’ (called ’factor loadings’ in some applications) and H is 
the rxn matrix of coefficients (or ‘factor scores’). The kxn matrix N contains residuals from the 
matrix factorization model fitting. Finding A and H from the knowledge of Z alone is a severely 
ill-posed problem, hence prior knowledge is needed to constrain the solution to be unique. 

Principal Component Analysis (PCA) ( , ) is probably the most widely used dimen¬ 

sionality reduction technique. It seeks principal directions that capture the highest variance in the 
data under the constraints that these directions are mutually orthogonal, thereby defining a subspace 
of the initial space that exhibit information rather than noise. The PCA solution can be written as a 
matrix factorization thanks to the Singular Value Decomposition (SVD) ( lg, 2 ), 

and so we use SVD in the clustering method presented here. 

The Nonnegative Matrix Factorization (NMF) ( , ) is also considered in this 

paper. Instead of imposing orthogonality, it constrains elements of matrices A and H to be nonneg¬ 
ative. We further impose that each column of H has elements that sum up to one, thereby effectively 
using a formulation identical to one used in hyperspectral unmixing ( , ). 

Unmixing techniques exploit the high redundancy observed in similar bandpasses. They aim at 
separating the various contributions and at estimating a smaller set of less dependent source images. 
Matrix factorization, known as ‘blind source separation’ in this context, has many applications, 
ranging from biomedical imaging, chemometrics, to remote sensing ( , ), 

and recently to the extraction of salient morphological features from multi-wavelength extreme 
ultraviolet solar images ( , ). 

In this paper, we wish to factorize a kxn data matrix Z containing n observations of k different 
variables as in Equation (2) where the dictionary matrix A spans a subspace of the initial space, 
with r < k. We consider the cases where Z is formed from a single image as well as from multiple 
images. 

When a single image is used, the data matrix Z is built from a n pixel image by taking overlapping 
m x m -pixel neighborhoods called patches. Figure 2 (left) presents such a patch and its column 
representation. The k rows of the z-th column of Z are thus given by the nr pixel values in the 
neighborhood of pixel i. The right plot in Figure 2 provides the number of patches in each pair of 
AR images when using the STARA masks. When multiple images are used, such as when analyzing 
collectively all images from a given Mount Wilson class, the patches are combined into a single data 
matrix. Table 1 gives the total number of patches from each Mount Wilson class when using the 
STARA masks. 

A factorization of a data matrix containing image patches is illustrated in Figure 3. In this figure, 
let zi be the first column of Z, containing the intensity values for the first patch. These intensity 
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Figure 2. (Left) An example of a 3 x 3 pixel neighborhood or patch extracted from the edge of 
a sunspot in a continuum image and its column representation. (Right) The number of patches 
extracted from each pair of AR images when using the STARA masks. 



Figure 3. An example of linear dimensionality reduction where the data matrix of AR image patches 
Z is factored as a product of a dictionary A of representative elements and the corresponding co¬ 
efficients in the matrix H. The A matrix consists of the basic building blocks for the data matrix Z 
andH contains the corresponding coefficients. 


values are decomposed as a sum of r elements as in Equation (1) where a j is the /—th column of 
A and hj\ is the (/, l)-th element of H. In this representation, the vectors a j,j = 1,..., r are the 
elementary building blocks common to all patches, whereas the hj\ are the coefficients specific to 
the first patch. 

To compare ARs and cluster them based on this reduced dimension representation, some form 
of distance is required. To measure the distance between two ARs, we apply some metrics to the 
corresponding matrices A or H obtained from the factorizations of the two ARs. These distances 
are further introduced into a clustering algorithm that groups ARs based on the similarity of their 
patch geometry. 

This paper builds upon results obtained in ( ), which can be summarized as 

follows: 
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1. Continuum and magnetogram modalities are correlated, and there may be some advantage in 
considering both of them in an analysis. 

2. A patch size of m = 3 includes a significant portion of spatial correlations present in continuum 
and magnetogram images. 

3. Linear methods for dimensionality reduction (e.g. matrix factorization) are appropriate to analyze 
ARs observed with continuum images and magnetogram. 

With an AR area equal to n pixels and an m x m patch, the corresponding continuum data matrix X 

and magnetogram data matrix Y each have size m 2 xn. The full data matrix considered is Z = 

with size 2 m 2 x n = 18 x n in our case. The images extracted from both modalities are normalized 
prior to analysis. An intrinsic dimension analysis in ( ) showed further that a sunspot 

can be represented accurately with a dictionary containing six elements. 

1.4. Outline 

Section 2 describes two matrix factorization methods: the singular value decomposition (SVD) and 
nonnegative matrix factorization (NMF). While more sophisticated methods exist that may lead to 
improved performance, we focus on SVD and NMF to demonstrate the utility of an analysis of 
a reduced dimension representation of image patches for this problem. Future work will include 
further refinement in the choice of matrix factorization techniques. To compare the results from this 
factorization we need a metric, and so we use the Hellinger distance for this purpose. To obtain 
some insight on how these factorizations separate the data, we make some general comparisons in 
Section 3. In particular, with the defined metric, we compute the pairwise distances between Mount 
Wilson classes to identify which classes are most similar or dissimilar according to the matrix 
factorization results. 

Section 4 describes the clustering procedures that take the metrics’ output as input. The method 
called ‘Evidence Accumulating Clustering with Dual rooted Prim tree Cuts’ (EAC-DC) was in¬ 
troduced by ( ) and is used to cluster the ARs. By combining the two matrix 

factorization methods, a total of two procedures are used to analyze the data. Besides analyzing 
the whole sunspot data, we also look at information contained in patches situated along the neutral 
lines. The results of the clustering analyses are provided in Section 5. 

This paper improves and extends the work in (2014), where fixed size square pixel 

regions centered on the sunspot group were used as the ROI for the matrix factorization prior to 
clustering. Moreover, here we are using more appropriate metrics to compare the factorization re¬ 
sults. 

2. Matrix Factorization 

The intrinsic dimension analysis in ( ) showed that linear methods (e.g. matrix 

factorization) are sufficient to represent the data, and hence we focus on those. Matrix factorization 
methods aim at finding a set of basis vectors or dictionary elements such that each data point (in our 
case, pair of pixel patches) can be accurately expressed as a linear combination of the dictionary 
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elements. Mathematically, if we use m x m patches then this can be expressed as Z « AH, where Z 
is the 2 m 2 x n data matrix with n data points being considered, A is the 2 m 1 x r dictionary with the 
columns corresponding to the dictionary elements, and H is the rxn matrix of coefficients. The goal 
is to find matrices A and H whose product nearly approximates Z. The degree of approximation 
is typically measured by the squared error ||Z - AH|||, where || • ||r denotes the Frobenius norm 
( , ). Additional assumptions on the structure of the matrices A and H can be 

applied in matrix factorization depending on the application. Examples include assumptions of 
orthonormality of the columns of the dictionary A, sparsity of the coefficient matrix H ( 

, ), and nonnegativity on A and H ( , ). 

We consider two popular matrix factorization methods: the singular value decomposition (SVD) 
and nonnegative matrix factorization (NMF). 

2.1. Factorization using SVD 

To perform matrix factorization using SVD, we take the singular value decomposition of the data 
matrix Z = ULV r where U is the matrix of the left singular vectors, £ is a diagonal matrix contain¬ 
ing the singular values, and V is a matrix of the right singular vectors. If the size of the dictionary r 
is fixed and is less than 2m 2 , then the matrix of rank r that is closest to Z in terms of the Frobenius 
norm is the matrix product U,Z ; V ; r , where U, and V, are matrices containing only the first r sin¬ 
gular vectors and E, contains only the first r singular values ( g, 2 ). Thus for 

SVD, the dictionary and coefficient matrices are A = U r and H = £ r Vj, respectively. Note that 
SVD enforces orthonormality on the columns of U r . Further details are included in Appendix A.l. 

The intrinsic dimension estimated in ( ) determines the number of parameters 

required to accurately represent the data. It is used to provide an initial estimate for the size of 
the dictionaries r. For SVD, we choose r to be one standard deviation above the mean intrinsic 
dimension estimate, that is, r ^ 5 or 6. The choice of r is then further refined by a comparison of 
dictionaries in Section 3. See Section 4.2 for more on selecting the dictionary size. 

Figure 4 shows the learned dictionaries using SVD on the entire data set of 424 image pairs of 
ARs. Interestingly, the SVD seems to consider the continuum and magnetogram separately as the 
magnetogram elements are essentially zero when the continuum elements are not and vice versa. 
This is likely caused by the orthonormality constraint. The dictionary patches largely consist of 
a mix of uniform patches and patches with gradients in varied directions. The second dictionary 
element is associated with the average magnetic field value of a patch. 

2.2. Factorization using NMF 

Non-negative matrix factorization (NMF) ( , ) solves the problem of minimizing 

]|Z - AH||| while constraining A and H to have nonnegative values. Thus NMF is a good choice 
for matrix factorization when the data is nonnegative. For our problem, the continuum data is non¬ 
negative while the magnetogram data is not. Therefore we use a modified version of NMF using 
projected gradient where we only constrain the parts of A corresponding to the continuum to be 
nonnegative. An effect of using this modified version of NMF is that since the coefficient matrix H 
is still constrained to be nonnegative, we require separate dictionary elements that are either positive 
or negative in the magnetogram component. Thus we use approximately 1.5 times more dictionary 
elements for NMF than SVD. 
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Figure 4. Learned dictionary elements using SVD. Dictionary elements are constrained to be or¬ 
thonormal. The patches consist of uniform patches and gradients in varied directions. The magne¬ 
togram patches are essentially zero when the continuum components are nonzero and vice versa. 
The dictionary size r is first chosen based on the intrinsic dimension estimates in ( ) 

and then refined by comparing dictionaries of various sizes in Section 3. Section 4.2 contains more 
details on choosing r. 
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Figure 5. Learned dictionary elements using NMF where the continuum dictionary elements are 
constrained to be nonnegative. All the dictionary patches consist of uniform patches or gradients in 
varied directions. The order of the elements is not significant. The dictionary size r is chosen to be 
approximately 1.5 times larger than the SVD dictionary size, which is chosen based on the intrinsic 
dimension estimates in ( ) and then refined using the results in Section 3. 


Since we apply NMF to the full data matrix Z, this enforces a coupling between the two modal¬ 
ities by forcing the use of the same coefficient matrix to reconstruct the matrices X and Y. This is 
similar to coupled NMF which has been used in applications such as hyperspectral and multispectral 
data fusion ( ,2 ). 

Figure 5 shows the learned dictionary elements using NMF on the entire dataset. For NMF, the 
modalities are not treated separately as in the SVD results. But as for SVD, the patches largely 
consist of a mix of uniform patches and patches with gradients in varied directions. 

2.3. SVD vs. NMF 

There are advantages to both SVD and NMF matrix factorization methods which are summarized 
in Table 2. SVD produces the optimal rank r approximation of Z, is fast and unique, and results 
in orthogonal elements ( , ). NMF has the advantages of nonnegativity, spar¬ 

sity, and interpretability. The interpretability comes from the additive parts-based representation 
inherent in NMF ( , ). In contrast, the SVD results are not sparse which can 

make interpretation more difficult. However, NMF is not as robust as SVD as the NMF algorithm 
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Table 2. Summary of the advantages of SVD and NMF matrix factorization methods. The ad¬ 
vantages of one method complement the disadvantages of the other ( , ). For 

example, the NMF optimization problem is nonconvex with local minima resulting in solutions that 
depend on the initialization of the algorithm. 


SVD Advantages 

NMF Advantages 

Optimal rank r approximation 

Results are nonnegative 

Fast to compute 

Results are sparse 

Unique 

Sparsity and nonnegativity lead to improved interpretability 


is a numerical approximation to a non-convex optimization problem having local minima. Thus 
the solution provided by the NMF algorithm depends on the initialization. More details on matrix 
factorization using NMF and SVD are included in Appendix A.l. 

2.4. Methods for Comparing Matrix Factorization Results 

To compare the results from matrix factorization, we primarily seek a difference between the coef¬ 
ficients from H. To aid us in choosing a dictionary size r, we also require a measure of difference 
between dictionaries A. We use the Hellinger distance and Grassmannian projection metric to mea¬ 
sure the respective differences. 

In ( ), the Frobenius norm was used to compare the dictionaries. However, this 

fails to take into account the fact that two dictionaries may have the same elements but in a different 
order. In this case, the Frobenius norm of the difference between two dictionaries may be high 
even though the dictionaries span the same subspace. A better way to measure the difference would 
be to compare the subspaces spanned by the dictionaries. The Grassmannian Gr(r, V ) is a space 
which parameterizes all linear subspaces with dimension r of a vector space V. As an example, 
the Grassmannian Gr(2,R") is the space of planes through the origin of the standard Euclidean 
vector space in R". In our case, we are concerned with the Grassmannian Gr fr,R 18 J, where r is 
the size of the dictionary. The space spanned by a given dictionary A is then a single point in 
Gr (V, R 18 ). Several metrics have been defined on this space including the Grassmannian projection 
metric ( , ; , 1998). It can be defined as 

<^g(A, A') = ||Pa _ Pa'II » 

where Pa = A (A 7 A) A t is the projection matrix of A and ]|-|| is the f' 2 norm. This metric is invari¬ 
ant to the order of the dictionary elements and compares the subspaces spanned by the dictionaries. 
This metric has a maximum value of 1. 

To compare the coefficient matrices, we assume that the 18-dimensional pixel patches within an 
AR are samples from an 18-dimensional probability distribution, and that each AR has a corre¬ 
sponding (potentially unique) probability distribution of pixel patches. We project these samples 
onto a lower dimensional space by matrix factorization. In other words, we leam a dictionary A and 
the coefficient matrix H for the entire dataset Z, and then separate the coefficients in H according to 
the K different ARs (or groups of ARs) considered: Z = A (IG H 2 ... H A j. The columns of H, are 
a collection of projected, low-dimensionality, samples from the zth AR (or group), and we let f de¬ 
note the corresponding probability density function. Given two such collections, we can estimate the 
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difference between their probability distributions by estimating the information divergence. Many 
kinds of divergences exist such as the popular Kullback-Leibler divergence ( , 

1951). We use the Hellinger distance which is defined as (Hellinger, 1909; Bhattacharyya, 1946; 

Csiszar, 1967) 

H 2 (f, fj ) = 1 - J' y]fi(x)fj(x)dx, 

where f and fj are the two probability densities being compared. The Hellinger distance has the 
advantage over other divergences of being a metric which is not true of divergences in general. To 
estimate the Hellinger distance, we use the nonparametric estimator derived in 
( , ) that is based on the /.'-nearest neighbor density estimators for the densities f and f j. This 

estimator is simple to implement and achieves the parametric convergence rate when the densities 
are sufficiently smooth. 

3. Comparisons of General Matrix Factorization Results 

We apply the metrics mentioned in Section 2.4 and compare the local features as extracted by 
matrix factorization per Mount Wilson class. One motivation for these comparisons is to investigate 
differences between the Mount Wilson classes based on the Hellinger distance. Another motivation 
is to further refine our choice of dictionary size r in preparation for clustering the ARs. When 
comparing the dictionary coefficients using the Hellinger distance, we want a single, representative 
dictionary that is able to accurately reconstruct all of the images. Then the ARs will be differentiated 
based on their respective distributions of dictionary coefficients instead of the accuracy of their 
reconstructions. The coefficient distributions can then be compared to interpret the clustering results 
as is done in Section 5.1. 

Recall that our goal is to use unsupervised methods to separate the data based on the natural 
geometry. Our goal is not to replicate the Mount Wilson results. Instead we use the Mount Wilson 
labels in this section as a vehicle for interpreting the results. 

3.1. Grassmannian Metric Comparisons 

We first learn dictionaries for each of the Mount Wilson types by applying matrix factorization to 
a subset of the patches corresponding to sunspot groups of the respective type. We then use the 
Grassmannian metric to compare the dictionaries. For example, if we want to compare the a and /3 
groups, we collect a subset of patches from all ARs designated as a groups into a single data matrix 
Z a . We then factor this matrix with the chosen method to obtain Z„ = A Q H C1 . Similarly, we obtain 
Zg = AyjHg and then calculate d G (A a ,Af). 

The reason we use only a subset of patches is that each AR type has a different number of total 
patches (see Table 1) which may introduce bias in the comparisons. One source of potential bias 
in this case is due to the potentially increased patch variability in groups with more patches, which 
would result in increased difficulty in characterizing certain homogeneities of the patch features. 
This is mitigated somewhat by the fact that the local intrinsic dimension is typically less than 6 
( [., ). However, it is possible that there may be different local subspaces with the same 

dimension. A second source of potential bias is in the different levels of variance of the estimates 
due to difference in patch numbers. To circumvent these potential biases, we use the same number 
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Table 3. Difference between dictionaries learned from the collection of sunspot patches corre¬ 
sponding to the different Mount Wilson types as measured by the Grassmannian metric d G , e.g. 
d G { A ff , Ag). Dictionaries are learned using random subsets of the data and the results are reported 
in the form of mean+standard deviation using 100 trials. Different sizes of dictionaries r are used. 
The SVD results are sensitive to r. 


SVD, Pooled Grassmannian, r = 5 



a 

(3 

Py 

py$ 

a 

0.15 ±0.10 

0.26 ±0.18 

0.89 ± 0.06 

0.34 ±0.18 

P 


0.50 ± 0.29 

0.89 ±0.14 

0.43 ± 0.27 

Py 



0.24 ±0.16 

0.7 ± 0.2 

/3y5 




0.45 ± 0.28 


SVD, Pooled Grassmannian, r = 6 



a 

P 

Py 

Py5 

a 

0.02 ± 0.004 

0.03 ± 0.003 

0.02 ± 0.004 

0.04 ± 0.005 

P 


0.02 ± 0.005 

0.02 ± 0.004 

0.03 ± 0.006 

Py 



0.03 ± 0.006 

0.03 ± 0.006 

Py5 




0.03 ± 0.007 


NMF, Pooled Grassmannian, r = 8 



a 

P 

Py 

py5 

a 

0.40 ±0.13 

0.40 ±0.10 

0.33 ± 0.09 

0.37 ±0.10 

P 


0.29 ±0.13 

0.35 ± 0.09 

0.37 ±0.11 

Py 



0.37 ±0.12 

0.34 ±0.10 

PyS 




0.41 ±0.11 


NMF, Pooled Grassmannian, r 

= 9 


a 

P 

Py 

Py6 

a 

0.62 ± 0.25 

0.41 ±0.15 

0.45 ±0.19 

0.40 ±0.13 

P 


0.54 ± 0.23 

0.49 ±0.19 

0.44 ±0.19 

Py 



0.53 ±0.23 

0.44 ± 0.20 

Py6 




0.49 ± 0.20 


of patches in each group for each comparison. For the inter-class comparison, we randomly take 
13,358 patches (the number of patches in the smallest class) from each class to learn the dictionary, 
and then calculate the Grassmannian metric. For the intra-class comparison, we take two disjoint 
subsets of 6,679 patches (half the number of patches in the smallest class) from each class to leam 
the dictionaries. This process is repeated 100 times and the resulting mean and standard deviation 
are reported. 

Table 3 shows the corresponding average Grassmannian distance metrics when using SVD and 
NMF and for different sizes of dictionaries r. For SVD, the results are very sensitive to r. Choosing 
r = 5 results in large differences between the different dictionaries but for r = 6, the dictionaries 
are very similar. This suggests that for SVD, 6 principal components are sufficient to accurately 
represent the subspace upon which the sunspot patches lie. This is consistent with the results of 
( ) where the intrinsic dimension is found to be less than 6 for most patches. 
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Interestingly, for the r = 5 SVD results, the f3y group is the most dissimilar to the other groups 
while being relatively similar to itself. In contrast, the /i group is fairly dissimilar to itself and 
relatively similar to the a and (3y8 groups. 

The NMF results are less sensitive to r. The average difference between the dictionaries and its 
standard deviation is larger when r - 9 compared to when r = 8. However, for a given r, all of 
the mean differences are within a standard deviation of each other. Thus on aggregate, the NMF 
dictionaries learned from large collections of patches from multiple images differ from each other 
to the same degree regardless of the Mount Wilson type. 


3.2. Hellinger Distance Comparisons 

For the Hellinger distance, we learn a dictionary A and the coefficient matrix H for the entire data 
set Z. We then separate the coefficients in H according to the Mount Wilson type and compare the 
coefficients’ distributions using the Hellinger distance. For example, suppose that the data matrix 
is arranged as Z = ( Z a Zp Zp y Zp y6 ). This is factored as Z = A ( H„ H,, Hg r J. To compare 
the a and f3 groups, we assume that the columns in H (1 are samples from the distribution f a and 
similarly Hg contains samples from the distribution fp. We then estimate the Hellinger distance 
H(f a , fp) using the divergence estimator in ( ). 

When the Hellinger distance is used to compare the collections of dictionary coefficients within 
the sunspots, the groups are very similar, especially when using SVD (Table 4). This indicates that 
when the coefficients of all ARs from one class are grouped together, the distribution looks similar 
to the distribution of the other classes. However, there are some small differences. First the intr¬ 
aclass distances are often much smaller than the interclass distances which indicates that there is 
some relative difference between most classes. Second, for both matrix factorization methods, the 
(3y5 groups are the most dissimilar. This could be due to the presence of a 5 spot configuration, 
where umbrae of opposite polarities are within a single penumbra. Such a configuration may re¬ 
quire specific linear combinations of the dictionary elements as compared to the other classes. The 
presence and absence of these linear combinations in two Mount Wilson types would result in a 
higher Hellinger distance between them. 

Again, for clustering, we compute the pairwise Hellinger distance between each AR’s collection 
of coefficients. This is done by forming the data matrix from the 424 ARs as Z = ( Z\ Z 2 ... Z 42 4 ) 

and factoring it as Z = A ( Hj H 2 ... H 424 j. The columns of H, are samples from a distribution f 
and the distributions f and fj are compared by estimating H(f, fj). The corresponding dictionaries 
for the two methods are shown in Figures 4 and 5. 

Table 5 gives the average pairwise Hellinger distance between the ARs. The average distances 
differ more with the NMF based coefficients resulting in larger dissimilarity. The average distance 
is smallest when comparing the /3 groups to all others and largest when comparing the / 3y groups to 
the rest. The standard deviation is also larger when comparing a and /3 groups. This may be partially 
related to the variability in estimation due to smaller sample sizes as the a and /3 groups contain 
more of the smaller ARs (see Figure 2). Overall, the average distances show that there are clear 
differences between the ARs within the sunspots using this metric. 


13 


Moon et al: Image patch analysis of sunspots and active regions. II. Clustering 


Table 4. Difference between the collection of dictionary coefficients pooled from the different 
Mount Wilson classes as measured by the Hellinger distance. Intraclass distances are reported in 
the form of mean+standard deviation and are calculated by randomly splitting the data and then 
calculating the distance over 100 trials. The size of the dictionaries is r = 6 and 8 for SVD and 
NMF, respectively. The /3yd group is most dissimilar to the others. 


SVD, Pooled Hellinger 



a 

/3 

f3y 

/3yd 

a 

0.0006 ± 0.004 

0 

0 

0.03 

f3 


0.0005 ± 0.002 

0.01 

0.08 

(3y 



0.0003 ± 0.002 

0.05 

/3y6 




0.0004 ± 0.002 


NMF, Pooled Hellinger 



a 

P 

/3y 

(3y6 

a 

0±0 

0.08 

0.05 

0.10 

[3 


0.00007 ± 0.0004 

0.03 

0.12 

/3y 



0.000002 ± 0.00003 

0.11 

/3y6 




0.00001 ± 0.0002 


Table 5. Average pairwise difference between dictionary coefficients from each AR from different 
Mount Wilson types as measured by the Hellinger distance. Results are reported in the form of 
mean+standard deviation. The size of the dictionaries is r = 6 and 8 for SVD and NMF, respectively. 
The f3y ARs are most dissimilar to each other and the other classes while the (3 ARs are most similar. 


SVD, Average Hellinger 



4. Clustering of Active Regions: Methods 

4.1. Clustering Algorithm 

The clustering algorithm we use is the Evidence Accumulating Clustering with Dual rooted Prim 
tree Cuts (EAC-DC) method in ( ) which scales well for clustering in high 

dimensions. EAC-DC clusters the data by defining a metric based on the growth of two minimal 
spanning trees (MST) grown sequentially from a pair of points. To grow the MSTs, a base dissimi¬ 
larity measure is required as input such as the Hellinger distance described in Section 2.4. From the 
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Table 6. Summary of the dissimilarity and similarity measures used. 


Measure 

Type 

Properties 

Grassmannian metric 

Dissimilarity 

Compares dictionaries by comparing the subspace 
spanned by the dictionary elements 

Hellinger distance 

Dissimilarity 

Compares the underlying distributions of dictionary 
coefficients; estimated using Moon and Hero III (2014a) 

EAC-DC based measure 

Similarity 

Based on sequentially grown MSTs of the data; 
requires a base dissimilarity measure as input 


new metric defined using the MSTs, a similarity measure between inputs is created. It is fed into a 
spectral clustering algorithm that groups together inputs which are most similar. The similarity mea¬ 
sure based on the MSTs adapts to the geometry of the data, and this results in a clustering method 
that is robust and competitive with other algorithms ( , ). See Appendix A.2 for 

more details. 

4.2. Clustering Input: Dictionary Sizes 

As input to the clustering algorithm, we use the matrix factorization results as described in 
Section 2.4. We learn a single dictionary from the entire dataset. We then project the data onto 
a lower dimensional space, i.e. we learn the coefficient matrices H,. These matrices are the inputs 
in this method and the base dissimilarity measure is the Hellinger distance estimated using each 
AR’s respective coefficients. Table 6 provides a summary of the various dissimilarity and similarity 
measures that we use. 

As mentioned in Section 2, the estimated intrinsic dimension from ( ) is used to 

provide an initial estimate for the size of the dictionaries r. The choice of r is further refined by the 
dictionary comparison results from Section 3 . For SVD, we choose r to be one standard deviation 
above the mean intrinsic dimension estimate which is approximately 5 or 6. When comparing the 
dictionary coefficients, we want the single dictionary to accurately represent the images. The dic¬ 
tionary should not be too large as this may add spurious dictionary elements due to the noise. The 
results in Table 3 suggest that for SVD, the dictionaries are essentially identical for r - 6. This 
means that 6 dictionary elements are sufficient to accurately reconstruct most of the images. This 
is consistent with the intrinsic dimension estimates in ( ). Thus we choose r - 6 

when using the Hellinger distance for the SVD dictionary coefficients. 

Since our mixed version of NMF requires approximately 1.5 times the number of dictionary 
elements as SVD (see Section 2.1), we choose r = 8 within the sunspots. Since the differences 
between classes were similar for r - 8 and r - 9, choosing r - 8 strikes a balance between accurate 
representation of the data and limiting the effects of noise. 

4.3. Clustering Input: Patches within Sunspots and Along the Neutral Line 

Our main focus up to this point in this paper has been on data matrices Z containing the patches 
within the STARA masks, that is, within sunspots. The clustering based on these patches is dis¬ 
cussed in Sections 5.1-5.2. 
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Table 7. The labels used to compare with the clustering results when analyzing the effects of in¬ 
cluding the neutral line. 


# of clusters 

Mount Wilson Comparison 

2 

Simple (a and /j), complex (fiy and fiyS) 

3 

a, /3, and complex 

4 

Mount Wilson (a, /?, y By, and /3yS) 


It is well-known, however, that the shape of the neutral line separating the main polarities plays 
an important role in the Mount Wilson classification. For this reason, we conduct two experiments 
involving data from along the neutral line. 

The results of the first experiment are in Section 5.3 where we apply matrix factorization on a data 
matrix containing only the patches situated along the neutral line using the same ARs as in Sections 
5.1-5.2. To compute the location of a neutral line in this experiment, we assume it is situated in the 
middle between regions of opposite polarity, and proceed as follows. First, we determine regions of 
high magnetic flux of each polarity using an absolute threshold at 50 Gauss. Second, we compute 
for each pixel the distance to the closest high flux region in each polarity using the Fast Marching 
method ( in, ). Once the two distance fields (one for each polarity) are calculated, the 
neutral line can be obtained by finding the pixels that lie on or are close to the zero level set of the 
difference of these two distance fields. In this paper, we choose a maximum distance of 10 pixels to 
determine the neutral line region. 

We extract the patches that lie in the neutral line region and within the SMART mask associated 
to the AR. Call the resulting data matrix Z N . We then apply SVD or NMF matrix factorization 
as before and calculate the pairwise distance between each AR neutral line using the Hellinger 
distance on the results. Define the resulting 424 x 424 dissimilarity matrix as D v for whichever 
factorization method we are currently using. Similarly, define Z s and D$ as the respective data 
matrix and dissimilarity matrix of the data from within the sunspots using the same configuration. 
The base dissimilarity measure D inputted in the clustering algorithm is now a weighted average of 
the distances computed within the neutral line regions and within the sunspots: D = wD w +(l -w)Ds 
where 0 < w < 1. Using a variety of weights, we then compare the clustering output to different 
labeling schemes based on the Mount Wilson labels as shown in Table 7. 

For the second experiment, we perform clustering on a ROI that selects pixels along a strong 
field polarity reversal line. The high gradients near strong field polarity reversal lines in LOS mag¬ 
netograms are a proxy for the presence of near-photospheric electrical currents, and thus might 
be indicative of a non-potential configuration ( ichrijvci , ). To compute this ROI, the magne¬ 

tograms are first reprojected using an equal-area, sinusoidal re-projection that uses Singular-Value 
Padding ( , ). The latter is known to be more accurate than image interpolation on 

transformed coordinates. To conserve flux, the magnetograms are also area-normalized. 

In the reprojected magnetograms, we delimit an AR using sunspot information from the Debrecen 
catalog ( , ) to obtain the location of all the pixels that belong to the spots related 

to an AR (called "Debrecen spots" hereafter). The ROI consists of a binary array constructed as 
follows: 
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1. The pixels that belong to the Debrecen spots are assigned the scalar value 1 and all others are 
assigned the scalar value -1. 

2. From this two-valued array, we retrieve the distances from the zero level-set using a fast marching 

method ( , ) implemented in the Python SciKits’ module "sci ki t-fmm". 

3. We mask out the pixels within a distance of 80 pixels from the zero level-set (in the equal-area- 
reprojected coordinate system). 

4. The ROI is delimited by the convex hull of the resulting mask, that is, the smallest convex polygon 
that surrounds all 1-valued pixels. The resulting mask is a binary array with the pixels inside the 
convex hull set to True. 

Within the ROI, the R -value is calculated similarly to the method described in ( ). We 

dilate the image using a dilation factor of 3 pixels, and extract the flux using overlapping Gaussian 
masks with cr = 2 px. Integrating the non-zero values outputs the R-value, i.e, the total flux in the 
vicinity of the polarity-inversion line. 

We then perform an image patch analysis using image patches from this region. We do this by 
using either SVD or NMF to do dimensionality reduction on the patches, and then estimate the 
Hellinger distance between ARs using the reduced dimension representation. We exclude a groups 
from the analysis as they do not have a strong field polarity reversal line. This leaves 420 images 
to be clustered. The clustering assignments are then compared to the calculated R value via the 
correlation coefficient. The results are presented in Section 5.4. 

5. Clustering of Active Regions: Results 

Given the choices of matrix factorization techniques (NMF and SVD) we have two different cluster¬ 
ing results on the data. Section 5.1 focuses on the clusterings using data from within the sunspots, 
and Section 5.2 provides some recommendations for which metrics and matrix factorization tech¬ 
niques to use to study different ARs. The neutral line clustering results are then given in Section 5.3 
followed by the R value based experiment in Section 5.4. 

5.1. Clustering within the Sunspot 

We now present the clustering results when using the Hellinger distance as the base dissimilarity. 
The corresponding dictionary elements to the coefficients are represented in Figure 4 (for the SVD 
factorization) and in Figure 5 (for the NMF). 

The EAC-DC algorithm does not automatically choose the number of clusters. We use the mean 
silhouette heuristic to determine the most natural number of clusters as a function of the data 
( 1 ). The silhouette is a measure of how well a data point belongs to its assigned 

cluster. The heuristic chooses the number of clusters that results in the maximum mean silhouette. 
In both clustering configurations, the number of clusters that maximizes the mean silhouette is 2 so 
we focus on the two clustering case for all clustering schemes throughout. 

To compare the clustering correspondence, we use the adjusted Rand index (ARI). The ARI 
is a measure of similarity between two clusterings (or a clustering and labels) and takes values 
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Table 8. Mean similarity of ARs to other ARs either in the same cluster (1 vs. 1 or 2 vs. 2) or in the 
other cluster (1 vs. 2) under the different schemes. Cluster 1 contains ARs that are very dissimilar 
to each other while Cluster 2 contains ARs that are very similar to each other. 


Mean Similarity 



1 vs. 1 

1 vs. 2 

2 vs. 2 

SVD, Hellinger 

0.29 

0.42 

0.87 

NMF, Hellinger 

0.30 

0.42 

0.88 


between -1 and 1. A 1 indicates perfect agreement between the clusterings and a 0 corresponds 
to the agreement from a random assignment ( 1, ). Thus a positive ARI indicates that the 

clustering correspondence is better than a random clustering while a negative ARI indicates it is 
worse. The ARI between the NMF and SVD clusterings is 0.27 which indicates some overlap. 

Visualizing the clusters in lower dimensions is done with multidimensional scaling (MDS) as 
in (2014). Let S be the 424 x 424 symmetric matrix that contains the AR pair similarities 

as created by EAC-DC algorithm. MDS projects the similarity matrix S onto the eigenvectors of 
the normalized Laplacian of S ( , ). Let c, 6 R 424 be the projection onto the 

zth eigenvector of S using NMF. The first eigenvector represents the direction of highest variability 
in the matrix S, and hence a high value of the A—th element of Ci indicates that the A—th AR is 
dissimilar to other ARs. 

Figure 6 displays the scatter plot of Ci vs. c 2 (top) and Ci vs. c 3 (bottom) using NMF. Comparing 
them with the Mount Wilson classification we see a concentration of simple ARs in the region with 
highest Ci values (most dissimilar ARs), and a concentration of complex ARs in the region with 
lowest Ci (more similar ARs). We can show this more precisely by computing the mean similarity 
of the zth AR to all other ARs as the mean of the zth row (or column) of S. The value from Ci is then 
inversely related to this mean similarity as seen in Figure 7. 

The similarity defined under this clustering scheme gathers in Cluster 2 ‘similar’ AR that are 
for a large part of the type f3y and [iyd, whereas Cluster 1 contains AR that are more ‘dissimilar’ 
to each other, with a large part of a or active regions. The other clustering configuration has a 
similar relationship between the first MDS coefficient and the mean similarity. 

Table 8 makes this clearer by showing the mean similarity measure within each cluster and be¬ 
tween the two clusters, which is calculated in the following manner. Suppose that the similarity 
matrix is organized in block form where the upper left block corresponds to Cluster 1 and the lower 
right block corresponds to Cluster 2. The mean similarity of Cluster 1 is calculated by taking the 
mean of all the values in the upper left block of this reorganized similarity matrix. The mean sim¬ 
ilarity of Cluster 2 is found similarly from the lower right block and the mean similarity between 
the clusters is found from either the lower left or upper right blocks. These means show that under 
the NMF clustering scheme, ARs in Cluster 2 are very similar to each other while ARs in Cluster 
1 are not very similar to each other on average. In fact, the ARs in Cluster 1 are more similar to 
the ARs in Cluster 2 on average than to each other. The other clustering configuration has a similar 
relationship between cluster assignment and mean similarity. 

This relationship between AR complexity and clustering assignment is further noticeable in 
Figure 8 which gives a histogram of the Mount Wilson classes divided by clustering assignment. 
This figure shows clear patterns between the clusterings and Mount Wilson type distribution, where 
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Figure 6. Scatter plot of MDS variables Ci vs. c 2 (top) and Ci vs. c 3 (bottom) where c, 6 R 4 24 is 
the projection of the similarity matrix onto the zth eigenvector of the normalized Laplacian of the 
similarity matrix when using the NMF coefficients. Each point corresponds to one AR and they 
are labeled according to the clustering (left) and the Mount Wilson labels (right). In this space, the 
clusters data appear to be separable and there are concentrations of complex ARs in the region with 
lowest Ci values. Other patterns are described in the text. 


the clustering separates somewhat the complex sunspots from the simple sunspots. This suggests 
that these configurations are clustering based on some measure of AR complexity. 

The Hellinger-based clusterings are correlated with sunspot size for some of the Mount Wilson 
classes, see Table 9. Based on the mean and median number of pixels, the Hellinger distance on the 
NMF coefficients tends to gather in Cluster 2 the smallest AR from classes a, /?, and fly. Similarly, 
the Hellinger distance on the SVD coefficients separates the /3 and fly AR by size with Cluster 1 
containing the largest and Cluster 2 containing the smallest AR. 

Since the Hellinger distance calculates differences between ARs based on their respective dis¬ 
tribution of dictionary coefficients, we can examine the coefficient distribution to obtain insight 
on what features the clustering algorithm is exploiting. For simplicity, we examine the marginal 
histograms of the coefficients pooled from ARs of a given cluster. When looking at the SVD coeffi- 
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Figure 7. Mean similarity of an AR with all other ARs as a function of its MDS variable C\ using 
NMF. Cluster 2 is associated with those ARs that are most similar to all other ARs while Cluster 1 
contains those that are least similar to all others. 


SVD, Hellinger NMF, Hellinger 



Figure 8. Histograms of the Mount Wilson classes divided by clustering assignment using the 
Hellinger distance. Cluster 1 contains more of the complex ARs while Cluster 2 contains more 
of the simple ARs. 

Table 9. Mean and median number of pixels of the ARs in each cluster under the Hellinger cluster¬ 
ing schemes. Cluster 1 contains the larger sunspots for all groups when using NMF and for some of 
the groups when using SVD. 


Number of Pixels 



a 

P 

Py 

Py5 

All 

Cluster 

1 

2 

1 

2 

1 

2 

1 

2 

1 

2 

Mean, SVD Hellinger 

278 

260 

582 

270 

823 

577 

1234 

1354 

756 

422 

Mean, NMF Hellinger 

384 

183 

788 

156 

847 

515 

1418 

880 

882 

283 

Median, SVD Hellinger 

148 

128 

477 

70 

612 

472 

1012 

1172 

588 

174 

Median, NMF Hellinger 

265 

121 

580 

56 

631 

393 

1157 

665 

677 

105 
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Figure 9. Histograms of the marginal distributions of the coefficients corresponding to the mean 
magnetic field value (dictionary element 2 in Figure 4) for Cluster 1 (left) and Cluster 2 (right) 
using the SVD coefficients. Cluster 1 ARs contain more patches with near neutral magnetic field 
values. 


cients, we see that their marginal distributions are similar across clusters, except for the coefficients 
that correspond to the second dictionary element of Figure 4. Recall that this second dictionary ele¬ 
ment is associated with the average magnetic field value of a patch. If the corresponding coefficient 
is close to zero, it means the average magnetic field in the patch is also close to zero. 

Figure 9 shows histograms of the coefficients of the second dictionary element. The histograms 
correspond to patches from all ARs separated by cluster assignment. The histograms show that 
Cluster 1 has a high concentration of patches with near zero average magnetic field. In contrast, 
the larger peaks for Cluster 2 are centered around +1 and -1. This suggests that the clustering 
assignments are influenced somewhat by the amount of patches in an AR that have near zero average 
magnetic field values. As we are considering only the core (sunspot) part of the AR, having 3 x 
3 patch with a near zero average magnetic field entails that the corresponding patch is likely to 
be located along the neutral line separating strong magnetic fields of opposite polarity. Thus the 
local distribution of magnetic field values is related to cluster assignments when using the SVD 
coefficients. This is consistent with Figure 8 where cluster 1 contains more of the complex ARs (J3y 
and /3 yd) and fewer simple ARs (a and /?) than cluster 2 as measured by the Mt. Wilson scheme. 

Checking the individual ARs and their coefficient distributions in each cluster, we see indeed that 
Cluster 1 does contain more ARs with patches having near zero average magnetic field. This tends 
to include more of the complex ARs in Cluster 1 since they are more likely to have a neutral line 
close to the regions of strong magnetic fields that will therefore be included in the STARA masks. 

It should be noted however that the correspondence is not perfect. There are some ARs in Cluster 
2 where the regions of opposing polarity are close to each other and some ARs in Cluster 1 where the 
regions of opposing polarity are far apart. Thus the distribution of average magnetic field values is 
only one factor in the natural geometry of the ARs defined by the Hellinger distance. As mentioned 
previously, the size of the AR is another factor, especially for /3 groups. This is consistent with 
Figure 8 where cluster 1 does contain some of the simple ARs which are less likely to have strong 
magnetic fields around the neutral line. 
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Investigating the joint histograms of the NMF dictionary elements corresponding to positive and 
negative magnetic field values reveals that the NMF Hellinger clustering results are also influenced 
by the local magnetic field distribution. 

All these observations indicate that the natural geometry exploited by both clustering configura¬ 
tions is related to some form of complexity of the ARs. 

5.2. Discussion of Sunspot Results 

We note that the Cluster 2 ARs containing the smallest sunspots are most similar to each other 
while the Cluster 1 ARs are more dissimilar (see Tables 8 and 9). This indicates that the Hellinger 
distance approaches are best for distinguishing between different types of larger or complex ARs. 

When NMF is applied on datasets where all values in the dictionary and coefficient matrices are 
constrained to be nonnegative, its results are generally more interpretable than SVD. In our ap¬ 
plication however, the magnetogram components can be negative. Hence the NMF results are not 
particularly sparse and lose some benefits of nonnegativity since the positive and negative magne¬ 
togram components can cancel each other. This results in some loss of interpretability. Additionally, 
the SVD results seem to be more interpretable due to separate treatment of the continuum and mag¬ 
netogram components. However, there is still some value in the NMF approach as we see that the 
clustering on NMF coefficients are better at separating the ARs by size than the SVD approach. 
Additionally, the NMF approaches tend to agree more strongly with the Mount Wilson labels than 
their SVD counterparts as is seen in Section 5.3 below. Future work could include using alternate 
forms of NMF such as in (2010) where sparsity and interpretability is preserved even 

when the dictionary is no longer constrained to be nonnegative. Another variation on coupled NMF 
that may be applicable is soft NMF ( , ) where the requirement that the two 

modalities share the same regression coefficients is relaxed somewhat. Finally, future work could 
perform factorization using a composite objective function comprised of two terms corresponding 
to the two modalities that are scaled according to their noise characteristics. 

5.3. Clustering with Neutral Line Data 

As described in Section 4.3, we analyze the effects of including data from the neutral line in the clus¬ 
tering. We proceed by taking a weighted average of the dissimilarities calculated from the sunspots 
and from the neutral line data matrices. Using the ARI, we compare the results to labels based on the 
Mount Wilson classification scheme, see Table 7 for the label definition. We use a grid of weights, 
starting from a weight of 0 for a clustering using only patches within sunspots up to a weight of 1 
for a clustering that takes into account only the neutral line data. 

Figure 10 plots the ARI for the four different schemes as a function of the weight. In nearly all 
cases, the ARI is above zero which indicates that the clustering does slightly better than a random 
assignment. In general, the correspondence of the clustering results with the Mount Wilson based 
labels decreases as the weight approaches 1. This means that the natural clusterings associated with 
only the neutral line data do not correspond as well with the Mount Wilson based labels. However in 
several cases, including some information from the neutral line at lower weights appears to increase 
the correspondence, e.g., the ARI increases for the 3 and 4 cluster cases for the SVD coefficients. 
This suggests that the neutral line and the sunspots contain information about AR complexity that 
may be different. 
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SVD, Hellinger 



Weight 


NMF, Hellinger 



Figure 10. Plot of the adjusted Rand index (ARI) using and Hellinger distances within the neutral 
line and sunspots as a function of the weight. A weight of 0 corresponds to clustering with only 
the sunspots while a weight of 1 clusters with only the neutral line. The different lines correspond 
to different numbers of clusters and the corresponding labels from Table 7. Higher ARI indicates 
greater correspondence. 


Note that clustering separates the ARs based on the natural geometry in the spaces we are con¬ 
sidering. Thus we can influence the clustering by choosing the space. For example, if we restrict 
our analysis to include only coefficients corresponding to specific dictionary patches, then this will 
influence the clustering. 

The gradients of magnetic field values across the neutral line is a key quantity used in several 
indicators of potential eruptive activity ( , ). We therefore repeated the neutral line 

experiment where we focused only on the gradients within the magnetogram as follows. When we 
applied SVD to the data matrix Z extracted from the neutral line, the resulting dictionary matrix 
A was very similar to that shown in Figure 4. Note that elements 3 and 4 correspond to the gra¬ 
dient patterns within the magnetogram data. Therefore, after learning A and H from Z, we kept 
only the coefficients corresponding to dictionary elements 3 and 4, i.e. the 3rd and 4th rows of H. 
We then estimated the Hellinger distance between the ARs’ underlying distributions of these two 
coefficients. This restricted the neutral line analysis to include only the coefficients corresponding 
to magnetogram gradients. For the data within the sunspots, we included all coefficients as before. 

Figure 11 shows the ARI as a function of the weight for this experiment. For all cases, the ARI 
stays fairly constant until the weight increases to 0.9, after which it drops dramatically. We can 
compare this to the results in Figure 10 (left) to determine if using only the neutral line gradient 
coefficients results in increased correspondence with the Mount Wilson labels relative to using 
all of the neutral line coefficients. From this comparison, the ARI is higher when using only the 
gradient components for weights greater than 0.1 and less than 1. Thus the correspondence with the 
Mount Wilson labels and the clustering is higher when we only include the magnetogram gradient 
coefficients. Since the Mount Wilson scheme is related to the complexity of the neutral line, this 
higher correspondence suggests that focusing on the gradients in the neutral line results in a natural 
geometry that is more closely aligned with the complexity of the neutral line than simply using all 
of the coefficients. Applying supervised techniques would lead to improved correspondence. 
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Figure 11. Plot of the ARI using the Hellinger distance within the neutral line on only the coeffi¬ 
cients corresponding to the SVD dictionary elements associated with magnetogram gradients. The 
corresponding dictionary elements are similar to the 3rd and 4th elements in Figure 4. Focusing on 
the gradients results in a higher ARI for higher weights than when all coefficients are used as seen 
in Figure 10. 


The clear patterns in the ARI indicate that the relationship between the weight and the ARI is 
unlikely to be due entirely to noise. Thus including data from the neutral line with the data from the 
sunspots would add value in an unsupervised setting and would likely lead to improved performance 
in a supervised setting. 

5.4. Clustering of Regions Exhibiting Strong Field Polarity Reversal Lines 

We now analyze ARs exhibiting strong field polarity reversal lines by comparing the natural clus¬ 
tering of these ARs to the calculated R value as described in Section 4.3. When we apply dimen¬ 
sionality reduction on this data using SVD, the resulting dictionary is very similar to Figure 4 with 
the first two patches consisting of uniform nonzero patches, the third and fourth patches consist¬ 
ing of gradients in the magnetogram, and the fifth and sixth patches consisting of gradients in the 
continuum. 

As before, the mean silhouette width indicates that the appropriate number of clusters is 2. When 
we cluster the ARs using the SVD coefficients corresponding to all six patches, we obtain a corre¬ 
lation between cluster assignment and R of 0.09 (see Table 10). This isn’t particularly high which 
suggests that the natural geometry based on the distribution of all six coefficients does not correlate 
well with R. However, since the clustering is separating the ARs based on the natural geometry 
in the spaces we are considering, we can influence the clustering by choosing the space. In other 
words, if we restrict our analysis to only coefficients corresponding to specific dictionary patches, 
then this will influence the clustering. 

Restricting the clustering analysis to the SVD coefficients corresponding only to the magne¬ 
togram components (i.e. elements 2, 3, and 4 in Figure 4) results in a correlation of 0.30 between 
cluster assignment and R value. If we only consider the gradient components (elements 3 and 4), 
then the correlation is 0.34. 
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Table 10. Magnitude of the correlation coefficient of the clustering assignment with either R or log R 
when using all of the coefficients, only the coefficients corresponding to the magnetogram compo¬ 
nent (SVD elements 2-4 in Figure 4), or only magnetogram gradient coefficients (SVD elements 
3-4 in Figure 4). For NMF, all of the dictionary elements are associated with the magnetogram and 
many of them have gradient components so we only perform clustering with all of the coefficients. 



SVD 

NMF 


All 

Mag. only 

Grad, only 

All 

R 

0.09 

0.30 

0.34 

0.15 

log R 

0.02 

0.37 

0.45 

0.08 


The relationship between cluster assignment and R may not be linear as the correlation between 
the clustering assignment using only the gradient components and log/? is 0.45. Comparing the 
magnetogram only components based clustering with log R similarly increases the correlation co¬ 
efficient. Given that clustering is an unsupervised method and that we are only clustering into 
two groups, this correlation is quite high. This suggests that the natural geometry of the image 
patch analysis increasingly corresponds with R as we restrict the analysis to magnetogram gra¬ 
dients. Supervised methods, such as regression, should lead to an even greater correspondence. 
Additionally, since the correlation is not perfect, this suggests that there is information in the image 
patch analysis that is not present in the R value which may be useful for AR analysis. 

For NMF, when we include all of the coefficients, the correlation between R and clustering assign¬ 
ment is 0.15. While this is small, we are again comparing the labels of an unsupervised approach 
to a continuum of values. Thus we can expect that the performance would be better in a supervised 
setting. If we compare the clusering to log/?, the correlation decreases to 0.08. It is difficult to re¬ 
strict the NMF dictionary to only continuum and magnetogram parts and gradients as most of the 
components contain a gradient component in the magnetogram. Therefore we only cluster the ARs 
using all NMF coefficients. 


6. Conclusion 

In this work, we introduce a reduced dimension representation of an AR that allows a data-driven 
unsupervised classification of ARs based on their local geometry. The ROI that surrounds and in¬ 
cludes the AR represents its most salient part and must be provided by the user. We used STARA 
masks in conjunction with masks situated around the neutral line, and compared our results with 
the Mount Wilson classification in order to ease interpretation of the unsupervised scheme. 

The Mount Wilson scheme focuses on the largest length scale when describing the geometrical 
arrangements of the magnetic field, whereas our method focuses on classifying ARs using infor¬ 
mation from fine length scale. We have shown that when we analyze and cluster the ARs based on 
the global statistics of the local properties, there are similarities to the classification based on the 
large scale characteristics. For example, when clustering using the Hellinger distance, one cluster 
contained most of the complex ARs. Other large scale properties such as the size of the AR also 
influenced the clustering results. Table 11 summarizes the properties that are found to influence the 
clustering under the two schemes. 
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Table 11. Summary of features distinguishing the clusters under the various classification schemes 
tested. 


Class. Scheme 

Cluster 1 

Cluster 2 

SVD Hellinger 

largest /?,/?y sunspots; majority of y 6y5\ 
high concentration of patches with 
average magnetic field value =* 0; 
large Hellinger distance between ARs 

smallest fi,/3y sunspots; high 
concentration of patches with average 
magnetic field value close to +1 or -1; 
small Hellinger distance between ARs 

NMF Hellinger 

largest a,fi,fiy sunspots; majority of /3yS\ 
large Hellinger distance between ARs 

smallest a,fi,fiy sunspots; 

small Hellinger distance between ARs 


In this comparison with the Mount Wilson scheme, we found that the STARA masks were some¬ 
times too restrictive which led to a mismatch between the Mount Wilson label and the extracted 
data. For example, there were several cases where an AR was labeled as a /? class but the STARA 
mask only extracted magnetic field values of one polarity. We showed that the neutral line contains 
additional information about the complexity of the AR. For this reason, we expect that including 
information beyond the STARA masks will lead to improved matching with the Mount Wilson 
labels. 

To investigate the possibility for our method to distinguish between potential and non-potential 
fields, we considered a ROI made of pixels situated along high-gradient, strong field polarity rever¬ 
sal lines. This is the same ROI as that used in the computation of the R value, which has proved 
useful in flare prediction in a supervised context. We found that our clustering was correlated with 
the R value, that is, the clustering based on the reduced dimension representation separates ARs 
corresponding to low R from the ones with large R. 

In future work, we plan to study the efficiency of supervised techniques applied to the reduced 
dimension representation. Supervised classification can always do at least as well as unsupervised 
learning in the task of reproducing class labels (e.g. Mount Wilson label). Indeed, in supervised 
classification, a training dataset with labels must be provided. A classifier (or predictor) is then 
built within the input feature space, and is used to provide a label to new observations. In con¬ 
trast, unsupervised classification or clustering separates the ARs naturally based on the geometry 
of the input feature space and does not use labels. Thus if the goal is for example to reproduce the 
Mount Wilson classes, or to detect nonpotentiality using global statistics of local properties, then 
supervised methods would lead to increased correspondence relative to our unsupervised results. 

In case of flare prediction, the labels would be some indicator of flare activity such as the strength 
of the largest flare that occurred within a specified time period after the image was taken. Supervised 
techniques such as classification or regression could be applied depending on the nature of the label 
(i.e. categorical vs. continuum). 

A good way of assessing how well a given feature space can do in a supervised setting is to 
estimate the Bayes error. The Bayes error gives the minimum average probability of error that any 
classifier can achieve on the given data and can be estimated in the two class setting by estimating 
upper and lower bounds such as the Chernoff bound using a divergence-based estimator as in 

( ). These bounds can be estimated using various schemes and combinations of 

data (inside sunspots, along the neutral line, etc.) to determine which scheme is best at reproducing 
the desired labels (e.g., the Mount Wilson labels). This can also be done in combination with phys- 
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ical parameters of the ARs such those used in ( ) (e.g. the total unsigned 

flux, the total area, the sum of flux near the polarity inversion line, etc.). 

These methods of comparing AR images can also be adapted to a time series of image pairs. For 
example, image pairs from a given point in time may be compared to the image pairs from an earlier 
period to measure how much the ARs have changed. The evolution of an AR may also be studied 
by defining class labels based on the results from one of the clustering schemes in this paper. From 
the clustering results, a classifier may be trained that is then used to assign an AR to one of these 
clusters at each time step. The evolution of the AR’s cluster assignment can then be examined. 
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Appendix A: Method Details 

A.l. Matrix Factorization 

As mentioned in Section 2, the goal of matrix factorization is to accurately decompose the 2 nr x n 
data matrix Z into the product of two matrices A (with size 2 m 2 x r) and H (with size r x n), where 
A has fewer columns than rows (r < 2m 2 ). The matrix A is the dictionary and the matrix H is the 
coefficient matrix. The columns of A form a basis for the data in Z. 

The two matrix factorization methods we use are singular value decomposition (SVD) and non¬ 
negative matrix factorization (NMF). These two methods can be viewed as solving two differ¬ 
ent optimization problems where the objective function is the same but the constraints differ. Let 
A = [ai, a 2 ,..., a r ] and H = [h 1; h 2 ,... h„]. For SVD, the optimization problem is 

min A , H ||Z-AH||| 

i • t 1 1, i = j . 

subject to a a, = < 

(0, i * j 

In words, SVD requires the columns of A to be orthonormal. 

For standard NMF, the optimization problem is 

min A ,H l|Z - AH|£ 
subject to a, > 0, V/ = 1,..., r , 
h, > 0, V/ = 1,... ,n 

where a > 0 applied to a vector a implies that all of a’s entries are greater than or equal to 0. In our 
problem, only the continuum is nonnegative so we only apply the constraint to the continuum part 
of the matrix A. So if a, and b, are both vectors with length nr corresponding to the continuum and 
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magnetogram parts, respectively, then we have A = 

constrains the columns of H to lie on a simplex, i.e. 
for our approach to NMF is 


ai a2 
bi b 2 


a, 

b, 


. The NMF method we use also 


Z /=i h,(/) = 1. Thus the optimization problem 


min A , H ||Z - AH||“ 
subject to a, >0, Vi = 1,..., r 
h, >0, Vi = 1, ... ,n ' 

Z'/ =1 h/0') = 1, Vi = 1 

This problem is not convex and is solved in an alternating manner by fixing H, finding the matrix A 
that solves the problem assuming H is fixed, and then solving for H while A is fixed. This process 
is repeated until the algorithm converges to a local minimum. See ( 7) for more details on 

the convergence analysis. 


A.2. The EAC-DC Clustering Method 

Let V = {vi, v 2 ,..., v,v} be a set of vertices and let E = {<?//}, where e tj denotes an edge between 
vertices v,, Vj, i,j e {1,..., N}, be a set of undirected edges between them. The pair (V, E) = G 
is the corresponding undirected graph. In our application, V corresponds to the set of AR image 
pairs being clustered and E contains all possible edges between the vertices. The weight of an edge 
e t j is defined as w tJ and measures the base dissimilarity between two vertices v, and v r In many 
applications, the base dissimilarity is the Euclidean distance. In our case, we use the Hellinger 
distance as the base dissimilarity measure. 

A spanning tree T of the graph G is a connected acyclic subgraph that passes through all N 
vertices of the graph and the weight of T is the sum of all the edge weights used to construct the 
tree, €r w,j. A minimal spanning tree of G is a spanning tree which has the minimal weight 

min T Ze.jtT Wij. 

Prim’s algorithm ( , ) is used by ( ) to construct the dual rooted MST. 

In Prim’s algorithm, the MST is grown sequentially where at each step, a single edge is added. This 
edge corresponds to the edge with minimal weight that connects a previously unconnected vertex 
to the existing tree. The root of the MST corresponds to the beginning vertex. For the dual rooted 
MST, we begin with two vertices v, and Vj and construct the minimal spanning trees 7) and Tj. At 
each step, the two edges that would grow both trees 7) and T l using Prim’s algorithm are proposed 
and the edge with minimal weight is added. This continues until 7) and Tj connect. The weight 
of the final edge added in this algorithm defines a new metric between the vertices v, and Vj. This 
process is repeated for all pairs of vertices and this new metric is used as input to spectral clustering 
(Talluccio et al., 201 ). 

A primary advantage of this metric based on the hitting time of the two MSTs is that it depends 
on the MST topology of the data. Thus if two vertices belong to the same cluster, then the MST 
distance between them will be small since cluster points will be close together. This is the case even 
if the vertices are far away from each other (e.g. on opposite ends of the cluster). However, if the 
two vertices are in different clusters that are well separated, then the MST distance between them 
will be large. See Figure A. 1 for an example. Thus this method of clustering is very robust to the 
shape of the clusters. ( ) contains many more examples. 
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Figure A.l. Dual rooted Prim tree built on a 2-dimensional data set when the roots are chosen from 
the same class (left) and different classes (right). The X’s mark the roots of the trees and the dashed 
line is the last connected edge. The length of the last connected edge is greater when the roots 
belong to clusters that are more separated. 

The MST based metric can be computationally intensive to compute as Prim’s algorithm must be 
run as many times as there are pairs of vertices. To counter this, ( ) proposed the 

EAC-DC algorithm which uses the information from only a subset of the dual rooted MSTs. This is 
done by calculating the dual rooted MSTs for a random pair of vertices. Three clusters are defined 
for each run: all vertices that are connected to one of the roots in the MSTs form two of the clusters 
(one for each root) while all points that are not connected to either of the MSTs are assigned to a 
third “rejection” cluster. A co-association measure for two vertices is then defined as the number of 
times those vertices are contained in the same non-rejection cluster divided by the total number of 
runs (dual rooted MSTs). This co-association measure forms a similarity measure to which spectral 
clustering is applied. 
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